Skip to content

PMM-14661 Fix QAN crashes, make inserts asynchronous#5499

Open
ademidoff wants to merge 4 commits into
mainfrom
PMM-14661-fix-qan-crashes
Open

PMM-14661 Fix QAN crashes, make inserts asynchronous#5499
ademidoff wants to merge 4 commits into
mainfrom
PMM-14661-fix-qan-crashes

Conversation

@ademidoff

@ademidoff ademidoff commented Jun 15, 2026

Copy link
Copy Markdown
Member

Ticket number: PMM-14661

Feature build: SUBMODULES-4406

Summary

Reduce QAN read-path scan cost and stabilize ingestion without changing the schema or result semantics. A larger structural change (a pre-aggregated rollup) seems necessary, but will be reconsidered separately.

Changes

  • Single-scan filter dimensions (reporter.go) — queryDimension uses sumIf(<metric>, <active-dimension predicate>) in one scan, replacing the filtered + "enumerate-with-0" UNION ALL (two scans → one per dimension), roughly halving the filters-panel scans.
  • Asynchronous inserts (default-users.xml, low-memory-users.xml) — enable ClickHouse async_insert on the default profile so metric inserts are buffered server-side and flushed in batches, cutting per-insert overhead and write contention on the QAN tables. We rely on the default wait_for_async_insert=1, per ClickHouse's strong recommendation: the INSERT still blocks until the server flushes, so commit errors are surfaced synchronously (keeping the existing rollback/retry path in data_ingestion.go meaningful) and the server can apply backpressure instead of being overrun. wait_for_async_insert=0 would silence those errors and defeat backpressure, so it is deliberately not used.
  • Ingest buffer left at 100 — the in-process request queue (requestsCap in data_ingestion.go) is intentionally unchanged. It is the backpressure/decoupling mechanism between the gRPC Save handler and the single DB-writer goroutine, not the batching layer, so async_insert does not make it redundant. With wait_for_async_insert=1, slower commits correctly chain into agent-side backpressure; enlarging the queue would only mask that and increase in-memory data loss on restart. Tune via ClickHouse-side settings (e.g. async_insert_busy_timeout_ms) if the qan_api2_data_ingestion_requests_len gauge rides near the cap under load.

Considered and dropped

  • Parallel sparklines — an earlier revision fetched per-row sparklines concurrently (a bounded errgroup) to cut report latency. Reverted: running several sparkline aggregations at once multiplies peak ClickHouse memory during a single report, which works against the OOM/crash fix this PR targets — most acutely on the low-memory profile. It also forced a larger shared connection pool (read load could starve the ingest writer). Sparklines remain serial (one query per row), and maxOpenConns/maxIdleConns stay at their baseline.

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.53%. Comparing base (16ac903) to head (1434e43).
⚠️ Report is 3 commits behind head on v3.

Additional details and impacted files
@@            Coverage Diff             @@
##               v3    #5499      +/-   ##
==========================================
- Coverage   43.54%   43.53%   -0.01%     
==========================================
  Files         413      413              
  Lines       42928    42943      +15     
==========================================
+ Hits        18692    18697       +5     
- Misses      22363    22373      +10     
  Partials     1873     1873              
Flag Coverage Δ
admin 34.78% <ø> (ø)
agent 49.15% <ø> (ø)
managed 42.84% <ø> (-0.01%) ⬇️
vmproxy 72.22% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Bring over the net changes from PR #5443 (branch
PMM-15113-improve-qan-read-queries): parallel sparklines and the
single-scan sumIf filter merge in the QAN reporter/profile paths, plus
the accompanying reporter test and Makefile/AGENTS updates.

The add-then-revert 1-hour metrics rollup churn from that PR's history is
excluded; only its final net diff is applied.
@ademidoff ademidoff marked this pull request as ready for review June 15, 2026 06:12
@ademidoff ademidoff requested a review from a team as a code owner June 15, 2026 06:12
@ademidoff ademidoff requested review from 4nte and JiriCtvrtka and removed request for a team June 15, 2026 06:12
Comment thread build/ansible/roles/clickhouse/files/default-users.xml
Comment thread build/ansible/roles/clickhouse/files/low-memory-users.xml
Comment thread qan-api2/models/base.go Outdated
// run concurrently. It is kept well below the connection pool size (see maxOpenConns
// in main.go) so one request cannot monopolize the pool — which is shared with the
// data ingestion writer — or flood ClickHouse with concurrent scans.
const MaxParallelQueries = 4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this is maxOpenConns enough? It is 10 and shared with the ingest writer, report Select(), filter queries, and up to 4 parallel sparklines per report. I believe in this case few concurrent QAN users could exhaust the pool and time out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this is moot now. I've removed the sparkline parallelization (and MaxParallelQueries) in 1434e43.

Running several sparkline aggregations concurrently multiplies peak ClickHouse memory per report, which works against the OOM/crash fix this PR targets — most acutely on the low-memory profile — and it coupled read concurrency to the shared pool (a burst of reports could starve the single ingest writer). Sparklines are serial again and maxOpenConns/maxIdleConns stay at baseline (10/5), so the pool-exhaustion concern no longer applies.

The report-latency win will be revisited separately, likely via a pre-aggregated rollup rather than more concurrency.

@ademidoff ademidoff changed the title PMM-14661 Fix QAN craches PMM-14661 Fix QAN crashes Jun 16, 2026
@ademidoff ademidoff changed the title PMM-14661 Fix QAN crashes PMM-14661 Fix QAN crashes, make inserts asynchronous Jun 17, 2026
@JiriCtvrtka JiriCtvrtka self-requested a review June 17, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants